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Abstract 

This  report  describes  the  lessons  learned  using  the  In¬ 
dri  search  system  during  the  2004-2006  TREC  Terabyte 
Tracks.  We  provide  an  overview  of  Indri,  and,  for  the  ad 
hoc  and  named  page  finding  tasks,  discuss  our  general  ap¬ 
proach  to  the  problem,  what  worked,  what  did  not  work, 
and  what  could  possibly  work  in  the  future. 

1  Introduction 

The  Terabyte  Track  consists  of  three  tasks,  which  in¬ 
clude  the  efficiency  task,  the  ad  hoc  retrieval  task,  and  the 
named  page  finding  task.  In  our  previous  TREC  reports, 
we  only  describe  our  official  submissions  [7,  8],  How¬ 
ever,  many  additional  methods  were  tried.  Some  of  these 
methods  produced  poor  results  or  failed  to  improve  effec¬ 
tiveness.  Others  showed  promise,  but  were  never  fully 
investigated  due  to  time  constraints.  In  this  report,  we 
first  provide  an  overview  of  the  Indri  retrieval  system  [10] 
and  then  summarize  the  outcomes  of  our  experiments  us¬ 
ing  Indri  by  describing  those  methods  what  worked,  those 
that  did  not,  and  those  that  potentially  could  work  for  the 
ad  hoc  and  named  page  finding  tasks. 

2  System  Overview 

The  Indri  retrieval  system  was  built  to  evaluate  complex 
structured  queries  on  large  corpora.  Development  began 
in  October  2003,  and  the  system  was  finished  just  in  time 
to  run  experiments  for  the  first  Terabyte  Track  in  2004. 


Indri  was  originally  meant  to  be  a  small  modification  to 
the  Lemur  Project  code.  There  were  enough  scalability  is¬ 
sues  with  the  original  Lemur  code  that  we  built  almost  an 
entirely  new  system.  The  Indri  system  is  now  distributed 
as  a  component  of  the  new,  larger  Lemur  toolkit1.  It  is 
open  source  and  freely  available  for  download. 

Our  first  goal  with  Indri  was  to  create  a  platform  for  ex¬ 
perimenting  with  ranking  strategies  for  large  collections. 
We  worked  to  balance  the  competing  goals  of  efficiency, 
effectiveness  and  flexibility.  Our  requirements  for  flexi¬ 
bility  led  us  to  adopt  a  document-at-a-time  scoring  strat¬ 
egy.  This  strategy  allowed  us  to  experiment  with  large 
collections  and  comple  queries  on  systems  with  very  lit¬ 
tle  memory.  We  also  chose  to  build  indexes  with  position 
information,  pseudo-relevance  feedback  structures,  and  a 
compressed  copy  of  the  collection  built-in.  These  large  in¬ 
dexes  enabled  us  to  quickly  experiment  with  the  retrieval 
models  outlined  in  this  paper.  These  choices  meant  that 
our  system  was  not  particularly  interesting  in  the  effi¬ 
ciency  task,  especially  in  2005  and  2006  as  other  teams 
began  to  hone  the  efficiency  of  their  systems. 

We  added  multithreading  to  our  system  in  2005,  which 
significantly  improved  our  query  performance.  Our  own 
experiments  showed  impressive  improvements  in  speed 
with  an  additional  thread  on  a  single  CPU  system,  and 
a  near  doubling  in  throughput  in  distributed  mode.  How¬ 
ever,  in  the  2005  efficiency  track  we  were  not  allowed  to 
run  multiple  queries  simultaenously  in  official  runs.  We 
did  not  participate  in  the  2006  efficiency  track,  which  al¬ 
lowed  for  parallelism  with  multiple  query  streams. 


1  http://www.lemurproject.org/indri 
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Indri  was  built  from  the  start  to  support  querying  across 
a  cluster  of  machines,  with  results  that  are  guaranteed  to 
be  the  same  as  if  all  the  documents  were  stored  in  a  sin¬ 
gle  index.  We  achieve  this  by  sharing  collection  statis¬ 
tics  with  all  nodes  in  the  cluster.  The  first  phase  of  each 
query  is  a  collection  statistics  phase,  where  the  query  bro¬ 
ker  requests  statistics  from  each  of  the  query  processing 
nodes.  The  query  is  then  processed  using  the  gathered 
statistics.  When  processing  queries  sequentially,  a  6-node 
cluster  achieved  about  three  times  the  query  throughput  of 
a  single  machine.  By  using  two  threads,  the  same  6-node 
cluster  achieved  about  4.5  times  the  query  throughput  of 
a  single  machine. 

3  Ad  Hoc  Task 

For  the  ad  hoc  retrieval  task,  our  strategy  was  to  get  the 
most  out  of  the  Indri  query  language  as  possible.  The 
query  language  provides  support  for  term  proximity  oper¬ 
ators,  such  as  ordered  and  unordered  windows,  synonyms, 
matching  based  on  document  structure  constraints,  docu¬ 
ment  priors,  and  the  ability  to  assign  weights  to  various 
query  language  constructs.  Therefore,  our  goal  was  to  find 
the  best  way  of  automatically  transforming  a  TREC  topic 
into  a  complex  structured  Indri  query.  We  now  briefly  de¬ 
scribe  the  lessons  learned  from  the  various  formulations 
and  strategies  that  we  tried. 

3.1  What  Worked 

3.1.1  Term  Proximity 

One  of  the  most  significant  and  consistent  improvements 
in  effectiveness  that  we  observed  came  from  using  term 
proximity  operators.  However,  as  will  become  appar¬ 
ent  shortly,  blind  application  of  term  proximity  operators 
does  not  work  particularly  well.  Instead,  we  found  that 
one  specific  term  proximity  formulation,  Metzler’s  de¬ 
pendence  model  formulation,  consistently  improved  ef¬ 
fectiveness  over  a  bag  of  words  baseline  [6]. 

Given  a  query,  the  dependence  model  is  essentially  a 
feature  expansion  mechanism.  The  original  query  is  ex¬ 
panded  to  include  exact  phrase  (#1)  and  unordered  win¬ 
dow  (#uwN)  features.  Two  very  important  parts  of  this 
formulation,  which  are  often  overlooked  or  not  present 


in  similar  models,  are  feature  weighting  and  the  feature 
smoothing.  Feature  weights  are  learned  by  directly  maxi¬ 
mizing  mean  average  precision  via  hill -climbing.  For  fea¬ 
ture  smoothing,  we  found  that  it  is  valuable  to  apply  dif¬ 
ferent  amounts  of  smoothing  to  single  term  features  and 
proximity  features  [5]. 

The  results  in  Table  1  compare  our  term  proximity  for¬ 
mulation  (DM-LM)  with  a  standard  bag  of  words  lan¬ 
guage  modeling-based  approach  (QL).  For  the  entire  set 
of  Terabyte  Track  topics,  the  term  proximity  formulation 
outperforms  the  bag  of  words  approach  by  8.2%  in  terms 
of  mean  average  precision. 

3.1.2  Query  Expansion 

We  also  found  query  expansion  to  be  another  valuable 
strategy.  For  query  expansion  purposes,  we  use  a  tech¬ 
nique  that  generalizes  Lavrenko’s  relevance  models  [4]  to 
work  with  the  useful  term  proximity  features  described  in 
the  previous  section. 

We  found  that  query  expansion  on  top  of  a  bag  of  words 
model  helped  significantly.  However,  when  it  was  was 
used  on  top  of  a  strong  term  proximity  formulation,  the 
improvement  was  amplified.  That  is,  we  observed  an  ad¬ 
ditive  effect,  rather  than  a  dampening  one. 

We  provide  results  in  Table  1  for  two  query  expan¬ 
sion  techniques  that  are  built  upon  the  dependence  model 
framework.  The  QE-LM  approach  uses  language  model¬ 
ing  features,  as  described  in  [6],  The  QL-BM25  approach 
uses  analagous  BM25  features.  Interestingly,  the  LM  fea¬ 
tures  seem  to  outperform  the  BM25  features  on  the  2004 
and  2005  topics,  but  not  the  2006  topics.  More  analysis 
must  be  done  to  understand  this  phenomenon  better. 

3.1.3  Document  Quality  Prior 

Finally,  in  terms  of  what  helped,  we  saw  mixed  results 
when  using  a  document  quality  prior.  The  prior  helped 
when  used  in  conjunction  with  a  bag  of  words  approach, 
but  actually  hurt  when  used  with  a  more  complex  formu¬ 
lation  that  used  term  proximity  and  query  expansion. 

The  document  quality  prior  is  based  on  two  features 
that  aim  to  measure  how  likely  the  document  is  to  contain 
useful  text  content.  It  was  meant  to  significantly  penalize 
documents  that  only  consist  of  tables,  java  applets,  and 


QL(T) 

DM-LM(T) 

QE-LM  (T) 

QE-BM25(T) 

QE-LM(TDN) 

2004  Topics  (701-750) 

0.2870 

0.3067 

0.3326 

0.3216 

0.3650 

2005  Topics  (751-800) 

0.3432 

0.3632 

0.4002 

0.3878 

0.4287 

2006  Topics  (801-850) 

0.3071 

0.3444 

0.3452 

0.3687 

0.4252 

All  Topics  (701-850) 

0.3126 

0.3383 

0.3595 

0.3596 

0.4065 

Table  1:  Ad  hoc  task  mean  average  precision  values  for  various  retrieval  strategies.  QL  represents  query  likelihood, 
DM-LM  is  dependence  models  with  language  modeling  features,  QE-LM  is  dependence  model  query  expansion  using 
LM  features,  and  QE-BM25  is  dependendence  model  query  expansion  using  BM25  features.  In  addition,  T  indicates 
a  run  that  only  makes  use  of  the  title  field  of  the  TREC  topic,  while  TDN  indicates  a  run  that  makes  use  of  the  title, 
description,  and  narrative  fields.  All  runs  are  automatic. 


images,  as  these  documents  are  unlikely  to  be  relevant  to 
ad  hoc  query  requests.  See  [8]  and  [11]  for  more  details. 

3.2  What  Did  Not  Work 

3.2.1  Statistical  Phrases  and  WordNet 

Although  improvements  were  achieved  using  term  prox¬ 
imity  features,  as  described  previously,  it  took  a  great  deal 
of  effort  and  experimentation  to  get  to  that  point.  The  first 
set  of  failed  experiments  focused  primarily  on  statistical 
phrases  and  WordNet. 

A  statistical  phrase  dictionary  was  built.  Then,  given 
a  query,  if  any  subphrase  within  the  query  was  found  in 
the  statistical  phrase  dictionary  it  was  added  as  an  exact 
phrase  to  the  query.  Various  formulations  were  tried,  in¬ 
cluding  removing  original  query  terms  if  they  occured  in  a 
statistical  phrase,  trading  off  weight  between  query  terms 
and  statistical  phrases,  among  others.  However,  none  of 
the  experiments  that  were  tried  improved  upon  the  bag 
of  words  baseline.  Similar  experiments  were  carried  out 
using  WordNet  to  automatically  construct  synonyms  of 
terms  and  phrases.  Such  formulations  performed  even 
worse  than  the  ones  done  using  statistical  phrases. 

3.2.2  Document  Structure 

As  observed  at  past  TREC  Web  Tracks  [2],  we  found  no 
use  for  document  structure.  Various  formulations  were 
tried,  such  as  our  named  page  formulation  (described  be¬ 
low),  but  with  no  success.  In  all  of  our  experiments,  the 
optimal  parameter  settings  found  gave  zero  weight  to  all 
fields  except  the  main  body  of  the  document. 


3.3  What  Could  Work 

One  promising  area  of  future  work  includes  using  an  ex¬ 
ternal  resource  of  query  expansion  terms.  It  was  recently 
shown  that  using  the  web  as  an  external  resource  sig¬ 
nificantly  improves  ad  hoc  retrieval  effectiveness  on  the 
WTlOg  web  collection  [1]. 

Several  preliminary  experiments  were  done  on  the 
GOV2  collection  and  found  to  significantly  improve  ef¬ 
fectiveness  when  compared  to  using  GOV2  as  the  source 
of  query  expansion  terms.  However,  due  to  time  con¬ 
straints,  no  further  experiments  were  done.  It  is  likely 
that  a  combination  of  term  proximity  and  query  expan¬ 
sion  against  the  web  will  yield  even  better  effectiveness 
than  that  achieved  in  Table  1 . 


4  Named  Page  Finding  Task 

For  the  named  page  finding  task,  our  strategy  is  to  con¬ 
struct  queries  that  utilize  Indri’s  document  structure  and 
document  prior  probability  capabilities. 

4.1  What  Worked 

4.1.1  Structure,  Link  Analysis,  Priors 

We  found  that  the  de  facto  best-practice  named  page  find¬ 
ing  formulation,  which  includes  using  document  struc¬ 
ture  [9],  link  analysis,  and  document  priors  [3],  continued 
to  work  well  on  the  2005  and  2006  named  paged  finding 
topics.  Our  system  made  use  of  all  three  techniques. 


QL-MM 

DM-MM 

2005  Topics  (601-872) 

0.4143 

0.4405 

2006  Topics  (901-1081) 

0.4980 

0.5123 

All  Topics 

0.4493 

0.4705 

Table  2:  Named  page  finding  task  mean  reciprocal  rank 
values  for  various  retrieval  strategies.  Here,  QL-MM  rep¬ 
resents  a  mixture  of  unigram  language  models  approach 
and  DM-MM  represents  a  mixture  of  dependence  mod¬ 
els  approach.  In  both  approaches,  the  mixing  models 
are  estimated  from  the  title,  heading,  anchor,  and 
mainbody  representations  of  a  web  page.  In  addition, 
inlink  and  PageRank  information  are  incorporated  in  the 
form  of  a  document  prior  probability  in  both  models. 


4.1.2  Term  Proximity 

As  with  the  ad  hoc  task,  we  also  found  term  proximity 
models  to  be  effective  for  the  named  page  finding  task. 
Although  the  two  tasks  are  fundamentally  different,  it  is 
interesting  to  see  that  term  proximity  improves  both. 

Table  2  shows  our  named  page  finding  results  from 
2005  and  2006.  For  this  year’s  track,  we  used  the  same 
general  query  formulation  as  last  year.  For  more  details 
see  [8],  The  table  compares  a  formulation  that  makes  use 
of  a  mixture  of  unigram  language  models  with  one  that 
uses  a  mixture  of  term  dependence  models.  Both  formu¬ 
lations  also  include  document  priors.  As  the  results  show, 
the  formulation  that  uses  term  proximity  improves  MRR 
by  4.7%  over  the  unigram  formulation. 

4.2  What  Did  Not  Work 

Our  primary  focus  was  the  ad  hoc  task,  and  therefore  we 
did  not  investigate  many  alternative  query  formulations 
for  the  named  page  finding  task.  For  this  reason,  there 
was  no  single  method  that  we  tried  that  did  not  work. 

4.3  What  Could  Work 

Some  preliminary  experiments  were  done  using  multiple- 
Bemoulli  language  models  in  place  of  multinomial  mod¬ 
els.  The  rationale  here  is  that  the  title  field  often  con¬ 
tains  very  little,  if  any,  text,  and  therefore  term  frequency, 
which  is  an  important  aspect  of  the  multinomial  model. 


is  much  less  important,  and  may  be  modeled  better  by  a 
multiple-Bernoulli  distribution.  This  is  an  interesting  area 
of  future  work. 
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